{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `plot_missing()`: analyze missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function `plot_missing()` enables thorough analysis of the missing values and their impact on the dataset. The *impact* is the change in the dataset's characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of `plot_missing()` for a given dataframe `df`.\n",
    "\n",
    "1. `plot_missing(df)`: plots the amount and position of missing values, and their relationship between columns\n",
    "2. `plot_missing(df, col1)`: plots the impact of the missing values in column `col1` on all other columns\n",
    "3. `plot_missing(df, col1, col2)`: plots the impact of the missing values from column `col1` on column `col2` in various ways.\n",
    "\n",
    "Next, we demonstrate the functionality of `plot_missing()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "\n",
    "`dataprep.eda` supports **Pandas** and **Dask** dataframes. Here, we will load the well-known [Titanic dataset](https://www.kaggle.com/c/titanic/data) into a Pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-04T11:28:53.682011Z",
     "start_time": "2020-11-04T11:28:49.857747Z"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.datasets import load_dataset\n",
    "import numpy as np\n",
    "df = load_dataset('titanic')\n",
    "df = df.replace(\" ?\", np.NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get an overview of the missing values with `plot_missing(df)`\n",
    "\n",
    "`plot_missing(df)` will generate four visualizations that lead to different understandings of the missing values in the dataset:\n",
    "1. A statistics table. This table shows the statistics of missing value for the entire dataframe. \"Missing Cell\" represents the total number of missing cells in the whole dataframe. \"Missing Cell (%)\" represents the percent of missing cells in the whole dataframe. \"Missing Columns\" and \"Missing Rows\" represent the number of columns/rows which contain at least one missing cell. \"Avg Missing Cells per Column\" and \"Avg Missing Cells per Row\" represent the average number of missing cells within one column/row.\n",
    "2. A bar chart depicting the amount of missing values in each column. There is an insight tab in the upper right-hand corner, which shows names of the columns and rows containing the most missing values, as well as their missing rate.\n",
    "3. A missing spectrum plot. In this visualization, the dataset is divided into bins, and each bin corresponds to a rectangle in the plot. Then, each rectangle is gray-scaled depending on the number of missing values in the bin. A light colour represents none or few missing values, and a dark colour represents many missing values.\n",
    "4. A nullity correlation heatmap. This visualization depcits how strongly the presence or absence of one variable affects the presence of another. From the Pyhton library [missingno](https://github.com/ResidentMario/missingno): Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).\n",
    "5. The fifth tab displays a dendrogram which allows one to correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmaps. The dendrogram uses a [hierarchical clustering algorithm](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.\n",
    "\n",
    "The following is an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-04T11:28:58.183565Z",
     "start_time": "2020-11-04T11:28:53.691768Z"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.eda.missing import plot_missing\n",
    "plot_missing(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the nullity correlation heatmap will be empty if less than two columns are partially missing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand the *impact* of the missing values in column *x* with `plot_missing(df, col1)`\n",
    "\n",
    "After getting an overview of the missing values with `plot_missing(df)`, we can analyze the impact of the missing values in a specific column `col1` with `plot_missing(df, col1)`. The *impact* of the missing values in column `col1` is the change in the dataset's characteristics after removing the rows where column `col1`'s values are missing. Here, we consider two types of characteristics: the histogram (for numerical columns) and the bar chart (for categorical columns). `plot_missing(df, col1)` plots the histogram or bar chart (for appropriate column types) for each column before and after removing the rows that contain missing values in column `col1`.\n",
    "\n",
    "The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-07-19T05:45:33.691200Z",
     "start_time": "2020-07-19T05:45:32.809538Z"
    }
   },
   "outputs": [],
   "source": [
    "plot_missing(df, \"Age\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand the impact of the missing values in column `col1` on column `col2` with `plot_missing(df, col1, col2)`\n",
    "\n",
    "\n",
    "`plot_missing(df, col1)` only displays the frequency distribution of each column before and after removing the rows containing missing values in column `col1`. If the user is specifically concerned with the impact of the missing values in one column `col1` on another column `col2`, she/he can call `plot_missing(df, col1, col2)`. `plot_missing(df, col1, col2)` plots the impact of the missing values in column `col1` on column `col2` in different ways depending on the type of column `col2`.\n",
    "\n",
    "If `col2` is a numerical column, `plot_missing(df, col1, col2)` shows the impact as a histogram, pdf, cdf, and box plot. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-07-19T05:52:56.548246Z",
     "start_time": "2020-07-19T05:52:56.216198Z"
    }
   },
   "outputs": [],
   "source": [
    "plot_missing(df, \"Age\", \"Fare\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If `y` is a categorical column, `plot_missing(df, col1, col2)` shows the impact as a bar chart. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-07-19T05:53:15.976162Z",
     "start_time": "2020-07-19T05:53:15.848649Z"
    }
   },
   "outputs": [],
   "source": [
    "plot_missing(df, \"Age\", \"Sex\")"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}